Goto

Collaborating Authors

 delta table


ReeFRAME: Reeb Graph based Trajectory Analysis Framework to Capture Top-Down and Bottom-Up Patterns of Life

Gudavalli, Chandrakanth, Zhang, Bowen, Levenson, Connor, Lore, Kin Gwn, Manjunath, B. S.

arXiv.org Artificial Intelligence

In this paper, we present ReeFRAME, a scalable Reeb graph-based framework designed to analyze vast volumes of GPS-enabled human trajectory data generated at 1Hz frequency. ReeFRAME models Patterns-of-life (PoL) at both the population and individual levels, utilizing Multi-Agent Reeb Graphs (MARGs) for population-level patterns and Temporal Reeb Graphs (TERGs) for individual trajectories. The framework's linear algorithmic complexity relative to the number of time points ensures scalability for anomaly detection. We validate ReeFRAME on six large-scale anomaly detection datasets, simulating real-time patterns with up to 500,000 agents over two months.


Scalable Vector Search for AI Apps with Milvus and Databricks

#artificialintelligence

Multi-modal embeddings are all the rage these days. Everyone wants a piece of them because they give you a way to convert unstructured data to representations that are useful for understanding the semantic nature of unstructured assets -- across image, text, audio, video, etc. These representations are vectors that can be used for a variety of purposes across use cases which require models for image similarity, deduplication, anomaly detection, text similarity, audio classification, video understanding, etc. To top that off, you don't have to be a data scientist with deep ML expertise to build these systems, nor do you need to have large amounts of data to start leveraging them. This is fine until you run into actual "hands on the keyboard" work for production.


Benchmarking Amazon EMR vs Databricks

#artificialintelligence

At Insider, we use Apache Spark as the primary data processing engine to mine our clients' clickstream data and feed ML-ready data into our machine learning pipelines to enable personalizations. We have been using Spark since version 1.5 and always looking for ways to improve efficiency. If you are interested too, check out our blog post about how Spark 3 reduced our Amazon EMR cost by 40%. To further improve our platform's efficiency, we decided to conduct a trial with the Databricks platform. Before moving forward with the Databricks platform and the benchmarks, let's see how we utilize Apache Spark and Amazon EMR, and the pain points to understand better our current solutions and challenges.


Machine Learning Data Lineage with MLflow and Delta Lake - Databricks

#artificialintelligence

Then we will show a live demo on how to use various versioning features from these two frameworks to achieve data lineage in the machine learning process. We know that Machine Learning Development is complex. To give a sense of it, this is a typical machine learning pipeline. You take your raw data, you do some ETL or featurise it or data prep. Then you want to do some training with this data to produce a model and deploy this model to production.